Fingerprint-based Similarity Search and its Applications

نویسندگان

  • Benno Stein
  • Sven Meyer
چکیده

This paper introduces a new technology and tools from the field of text-based information retrieval. The authors have developed – a fingerprint-based method for a highly efficient near similarity search, and – an application of this method to identify plagiarized passages in large document collections. The contribution of our work is twofold. Firstly, it is a search technology that enables a new quality for the comparative analysis of complex and large scientific texts. Secondly, this technology gives rise to a new class of tools for plagiarism analysis, since the comparison of entire books becomes computationally feasible. The paper is organized as follows. Section 1 gives an introduction to plagiarism delicts and related detection methods, Section 2 outlines the method of fuzzy-fingerprints as a means for near similarity search, and Section 3 shows our methods in action: It gives examples for near similarity search as well as plagiarism detection and discusses results from a comprehensive performance analyses. 1 Plagiarism Analysis Plagiarism is the act of claiming to be the author of material that someone else actually wrote (Encyclopædia Britannica 2005), and, with the ubiquitousness

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Maximum Common Substructure-Based Data Fusion in Similarity Searching

Data fusion has been shown to work very well when applied to fingerprint-based similarity searching, yet little is known of its application to maximum common substructure (MCS)-based similarity searching. Two similarity search applications of the MCS will be focused on here. Typically, the number of bonds in the MCS, as well as the bonds in the two molecules being compared, are used in a simila...

متن کامل

Statistical modeling of value distributions of similarity coefficients in virtual screening and its application to predicting fingerprint search performance

Similarity searching using fingerprints is a popular ligandbased virtual screening approach. The Tanimoto coefficient (Tc) is the most widely used measure for quantifying fingerprint similarity. In general, it is very difficult to assess the significance of the similarity of two molecules solely based on their calculated Tc values. In the literature, Tc cut-off values are frequently intuitively...

متن کامل

Target enhanced 2D similarity search by using explicit biological activity annotations and profiles

BACKGROUND The enriched biological activity information of compounds in large and freely-accessible chemical databases like the PubChem Bioassay Database has become a powerful research resource for the scientific research community. Currently, 2D fingerprint based conventional similarity search (CSS) is the most common widely used approach for database screening, but it does not typically incor...

متن کامل

Fingerprint Indexing and Verification

This paper presents fingerprint indexing based on graph information of minutiae, fingerprint classification and verification based on hierarchical agglomerative clustering technique. The proposed fingerprint indexing is invariant under translation and rotation. Its performance is evaluated in terms of several real-life datasets. The fingerprint database is clustered into five classes based on t...

متن کامل

Fuzzy-Fingerprints for Text-Based Information Retrieval

This paper introduces a particular form of fuzzy-fingerprints—their construction, their interpretation, and their use in the field of information retrieval. Though the concept of fingerprinting in general is not new, the way of using them within a similarity search as described here is: Instead of computing the similarity between two fingerprints in order to access the similarity between the as...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010